How would I make this faster? Parsing Word/sorting by heading [on hold]

Posted by Doof12 on Stack Overflow See other posts from Stack Overflow or by Doof12
Published on 2013-06-25T23:21:00Z Indexed on 2013/06/26 4:21 UTC
Read the original article Hit count: 297

Filed under:

python

|

regex

|

ms-word

|

pywin32

|

heading

Currently it takes about 3 minutes to run through a single 53 page word document. Hopefully you all have some advice about speeding up the process.

Code:

 import win32com.client as win32
 from glob import glob
 import io
 import re
 from collections import namedtuple
 from collections import defaultdict
 import pprint

 raw_files = glob('*.docx')

 word = win32.gencache.EnsureDispatch('Word.Application')
 word.Visible = False
 oFile = io.open("rawsort.txt", "w+", encoding = "utf-8")#text dump

 doccat= list()
 for f in raw_files:
     word.Documents.Open(f)
     doc = word.ActiveDocument #whichever document is active at the time
     doc.ConvertNumbersToText()
     print doc.Paragraphs.Count
     for x in xrange(1, doc.Paragraphs.Count+1):#for loop to print through paragraphs
         oText = doc.Paragraphs(x)
         if not oText.Range.Tables.Count >0 :
             results = re.match('(?P<number>(([1-3]*[A-D]*[0-9]*)(.[1-3]*[0-9])+))',  oText.Range.Text)
             stylematch = re.match('Heading \d', oText.Style.NameLocal)
             if results!= None and oText.Style != None and stylematch != None:
                 doccat.append((oText.Style.NameLocal,  oText.Range.Text[:len(results.group('number'))],oText.Range.Text[len(results.group('number')):]))
                 style = oText.Style.NameLocal
         else:
             if oText.Range.Font.Bold == True :
                 doccat.append(style, oText)

 oFile.write(unicode(doccat))
 oFile.close()

The for Paragraph loop obviously takes the most amount of time. Is there some way of identifying and appending it without going through every Paragraph?

Developer IT

How would I make this faster? Parsing Word/sorting by heading [on hold] - Developer IT

How would I make this faster? Parsing Word/sorting by heading [on hold]

python

regex

ms-word

pywin32

heading

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

OWASP Regex Repository: Is this regex correct?

Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

JS regex isn't matching, even thought it works with a regex tester

c# RegEx with "|"

Categories cloud